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Abstract 

Driven by the opportunity to harvest the flexibility related to building climate control for demand response applications, this work 
presents a data-driven control approach building upon recent advancements in reinforcement learning. More specifically, model- 
assisted batch reinforcement learning is applied to the setting of building climate control subjected to a dynamic pricing. The 
underlying sequential decision making problem is cast on a Markov decision problem, after which the control algorithm is detailed. 
In this work, fitted Q-iteration is used to construct a policy from a batch of experimental tuples. In those regions of the state 
space where the experimental sample density is low, virtual support tuples are added using an artificial neural network. Finally, the 
resulting policy is shaped using domain knowledge. The control approach has been evaluated quantitatively using a simulation and 
qualitatively in a living lab. From the quantitative analysis it has been found that the control approach converges in approximately 20 
days to obtain a control policy with a performance within 90% of the mathematical optimum. The experimental analysis confirms 
that within 10 to 20 days sensible policies are obtained that can be used for different outside temperature regimes. 
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Nomenclature 

ANN Artificial Neural Network 
BRL Batch Reinforcement Learning 
ELM Extreme Learning Machine 
EQI Eitted Q-Iteration 

HVAC Heating, Ventilation and Air Conditioning 
MABRL Model-Assisted Batch Reinforcement Learning 
MDP Markov Decision Process 
ME Membership Eunction 
MPC Model Predictive Control 
RL Reinforcement Learning 

1. Introduction 

Perez et al. estimate that 20 to 40% of the global energy is 
consumed in buildings m. About half of this energy is used for 
HVAC 0. As a consequence, control strategies for HVAC have 
received considerable academic attention in recent years. A 
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popular class of control strategies is that of model-based strate¬ 
gies, such as Model Predictive Control (MPC) HI. MPC for 
HVAC systems has been largely investigated in the recent lit¬ 
erature HI B 0 El 1, in both of its main aspects, modelling 
muni, and control GD. In MPC, at regular time intervals, a 
control action is selected by solving an optimization problem 
over a finite time horizon, which is typically a day for HVAC 
control. In MPC the impact of future disturbances, such as in¬ 
ternal heating and meteorological conditions, is taken into ac¬ 
count using forecasts. Predictive control allows using the load 
flexibility related to thermal storage, e.g. through the thermal 
inertia of the building or through direct heat storage ina. 

This flexibility can be harvested to enable demand response 
and provide load control services, which value has been in¬ 
creasing together with the share of renewable energy in the pro¬ 
duction mix. Examples of services are peak shaving and valley 
filling for a distribution system operator IfTSlI . ancillary services 
towards a transmission system operator Gma or energy arbi¬ 
trage GSl- However, deploying MPC can be a challenging task. 
The most significant challenge is to derive an accurate model 
which, in the case of thermal control, has to include the thermal 
dynamics and the actuation model. In G3, Siroky et al. give 
a detailed report on implementation issues of MPC controllers 
for building heating systems. 

In this context, completely data-driven approaches are 
deemed interesting, sacrificing performance for practicality. 
One possible embodiment uses data-driven model in combina¬ 
tion with an optimization algorithm to obtain a control policy 
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m. Alternatively, it is possible to learn directly the control 
policy by estimating a state-action value function through inter¬ 
action with the system. For example in lfT9l . Reinforcement 
Learning (RL) , a model-free control approach is applied to 
building thermal storage. In RL, the policy is updated online, 

1. e. at each time step. In Batch Reinforcement Learning (BRL) 
, on the other hand, the policy is calculated offline using a batch 
of historical data. Even though (B)RL is getting more mature 
El, as discussed in ED, combining techniques of RL with 
prior (domain) knowledge is a logical control paradigm. It is 
towards this direction that this paper is positioned, i.e. in ap¬ 
plying BRL in combination with prior knowledge to the opera¬ 
tion of a building climate control system for demand response 
applications. 

The basis of our approach is BRL with Fitted Q-Iteration 
(FQI) EllESl, where the learning of an optimal control policy 
is enhanced by virtual data coming from a model. For this rea¬ 
son, such approach is called Model-Assisted Batch Reinforce¬ 
ment Learning (MABRL) as discussed in Ell¬ 
in Section]^ an overview of the related literature is provided 
and the contribution of this work is explained. Following the 
approach presented in Esl . in Section]^ the building thermal 
scheduling is formalised as a sequential decision making prob¬ 
lem under uncertainty. In Sectionj^MABRL is detailed, while 
Section [^presents a quantitative and qualitative assessment of 
the performance of the controller. Finally, Section outlines 
the conclusions and discusses future research directions. 

2. Related Work 

This section gives a non-exhaustive overview of related work 
regarding MFC and RL for building climate control, after which 
the main contributions of this work are explained. 

2.7. Model Predictive Control 

When considering building climate control, MFC has re¬ 
ceived considerable attention in the recent literature 013111 
1^ . The overview of practical issues related to the implemen¬ 
tation of an MFC controller can be found in IIZTi . The key ele¬ 
ments of an MFC comprise mathematical models describing the 
building dynamics, comfort requirements and exogenous infor¬ 
mation such as user behavior and outdoor temperature. This in¬ 
formation is used to cast an optimization problem that is solved 
to dehne optimal control actions with respect to a dehned ob¬ 
jective function, subject to constraints provided by the model. 

In typical embodiments of MFC one tries to formalize the 
problem as a mixed integer problem to allow using fast solvers 
with performance guarantees. Therefore, a Linear Time Invari¬ 
ant model (LTI) of the system under control is to be identihed. 
If no domain knowledge is available, black-box identihcation 
techniques are used, such as subspace identihcation methods 
ESEll- Alternatively, gray-box models can be used, where the 
model structure is dehned and the parameters are estimated us¬ 
ing experimental data ||9l . In the context of thermal modelling 
a number of studies use thermal circuits 1301 |3D [32] [33] ED. 

Advanced climate control allows, besides efflcient use of en¬ 
ergy and comfort management, integration within aggregation 


schemes to provide ancillary services and portfolio manage¬ 
ment in demand side management llT5l . For example, in |[36l 
the aggregated hexibility of a cluster of buildings is used to 
provide balancing services using an aggregate-and-dispatch ap¬ 
proach. 

An alternative for LTI modelling is to use non-linear data- 
driven models, such as artihcial neural networks (ANNs) ifTSl 
[J71| . in combination with Dynamic Frogramming (DF) If^ to 
compute a control policy. This form of control can be seen as a 
form of RL ll3^ . 

2.2. Reinforcement Learning 

As discussed in Section [D RL is a model-free control tech¬ 
nique whereby a control policy is learned from interactions with 
the environment. A well established reinforcement learning 
method is Q-learning l40l where the state-action value func¬ 
tion, or Q-function, is learned. Compared to techniques pro¬ 
vided in the previous section, RL mitigates the risk of model- 
bias l24l as a policy is built around the data. When consider¬ 
ing Q-learning and its applications to demand response, mainly 
traditional Q-learning has been used BTl l42l [T9l . More re¬ 
cently BRL Il43ll44l in the form FQI ETII has been investigated. 
The main advantage of BRL is the practical learning time re¬ 
quired for convergence (20-40 days in ll43] l44ll ') which comes 
at the cost of an increased computational complexity. Although 
BRL can rival the performance of MFC techniques, as indi¬ 
cated in ED, the context of demand response allows to add 
prior knowledge to the optimal control problem that can result 
in faster convergence. A hrst approach uses prior knowledge by 
shaping the policy, obtained with FQI, by means of constrained 
regression Il22l . A second approach is described by Lampe et 
al. in Il24l . Here virtual data from a model is used together with 
experimental data to obtain an approximation of the Q-function 
(state-action value function). 

Building upon |[43]|22]|2ll, this work has the following con¬ 
tributions: 

• BRL in the form of FQI ED in combination with vir¬ 
tual trajectories ED and policy shaping is applied to a 
HVAC system for a typical objective of dynamic pricing 
is. This effectively results in a data-driven solution for 
building climate control systems, combining state of the 
art BRL with domain knowledge; 

• Quantitative and qualitative performance assessment of 
MABRL in a simulated and experimental environment, 
where the operation of an air conditioner is subject to dy¬ 
namic energy pricing. 

3. Problem Formulation 

Before presenting the control approach in Section]^ this Sec¬ 
tion formulates the decision-making process as a Markov De¬ 
cision Frocess (MDF) Ii38ll4^ . An MDF is dehned by its state 
space X, its action space U, and a transition function /: 

X^+l = /(Xi,Ui:,wQ, (1) 
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which describes the dynamics from e X to under the 
control action ua; 6 U, and subject to a random process Wk 6 W, 
with probability distribution p„{-,Xk). The reward accompany¬ 
ing each state transition is r^: 

rk{xk,\ik,Xk+i) ^ p{Xk,\ik,y>ik) ( 2 ) 

which is here considered to a cost, as it accounts for the en¬ 
ergy price. Therefore, the objective is to hnd a control policy 
h : X ^ U that minimises the T -stage cost starting from state 
xi, denoted by 7^(xi): 

/''(xi) = E(f?''(xi,wi,...,W7')), (3) 

with: 

T 

R'‘(xi,Wi,...,Wt) = '^p(Xk,h(Xk),Wk). (4) 

k=l 


3.1.3. Exogenous Information 

The exogenous (uncontrollable) information x^^^ k is consid¬ 
ered to have an impact on Xphvs.k, but it is invariant for control 
actions ua. In this study the exogenous state information con¬ 
sists of the outside temperature. To, and the solar radiance, S; 

ti-ex.k - (To,k, S k) ■ ( 9 ) 

In this work it assumed that a forecast of the outside temper¬ 
ature and the solar radiance is available when constructing the 
policy h, as will be detailed in Section [4~2] (? is used to denotes 
a forecast). 

3.1.4. Control action 

In this work the control action is a binary value indicating if 
the HVAC system should switch ON or OFF: 

MAe{0,l). (10) 


It is worth remarking that an optimal control policy, here de¬ 
noted by h*, satisfies the Bellman optimality equation: 


r (x) = min E {p(x, u, w) -H r (/(x, u, w))) (5) 

u w-f’.t.lx) 

Typical techniques to hnd policies in an MDP framework are 
value iteration, policy iteration, and policy search Il22ll . As men¬ 
tioned earlier, in this work MABRL (related to value iteration) 
is considered. 

3.1. State description 

Following the approach presented by Ruelens et al. in ll25l . 
it is assumed that the state space X consists of: time-dependent 
state information Xt, controllable state information Xphys, and 
exogenous (uncontrollable) state information X^p. 

X — Xf X Xpkys X Xg^. (6) 

In the following, each component of the state space is detailed. 

3.1.1. Timing 

The time-dependent information component X, contains in¬ 
formation related to timing. In this implementation the quarter 
hour during the day has been used: 

2fr = {l,...,96), 


with, e.g. the day of the week, can be done at little extra cost. 
However, this extension is outside the scope of this work. 

3.1.2. Physical representation 

The controllable state information Xphys,k consists of the in¬ 
door air temperature, Tp 


Xphys.k -Tk I Tfk'^Tk <Tk 


( 8 ) 


where and Tk denote the lower and upper bound set by the 
end consumer. 


The control action of the previous control event ma i is also 
added to the state information, as it is relevant for the dynamics 
of the HVAC system. In fact, its value will be used to avoid too 


frequent switching as discussed in Section 3.2 As a result, the 
hnal state vector is dehned as: 


Xa — (.Tf A? Ta, Tq As Sk., Uk-\) . 


( 11 ) 


As this state vector only contains part of the actual state of the 
system, a common approach to enrich the state vector is to add 
previous state tuples 1471 . This, however, results in an increased 
state dimension that could be reduced by means of feature ex¬ 
traction, e.g. non-linear principal component analysis l48l . In 
this work however, the state vector is defined according to Q- 

3.2. Backup controller and physical realisation 

The HVAC system is assumed to be equipped with a backup 
controller, which acts as a hlter to the control actions resulting 
from the policy h. The function B : X x U —> U maps the 
requested control action ma taken in state Xk to a physical control 


action u 


phys . 


phys 


=B{Xk,Uk,0), (12) 

with 0 containing system specihc information. In this case, 0 
contains and T a, and B (■) is dehned as: 


( 13 ) 


3.3. Reward model 

As discussed in the introduction, different applications can 
be considered to harvest the hexibility related to climate control 
of buildings. In dynamic pricing or energy arbitrage m, i e. 
responding to an external price vector A, the reward function is 
dehned as: 




1 

if 

Tk<lk 

(7) 

B{Xk, Uk, 0) = • 

1 

if 

Tkk ^ ~ 1 

Uk 

if 

Tk <Tk A Uk-i - 0 

this 


,0 

if 

Tk >Tk 


p[xk,uf'-\A^ = -PAtAkf 


phys 


( 14 ) 


with P is the average power consumption of the air conditioner 
during the time interval At. 
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Figure 1: Overview of the information flow in this implementation of MABRL using FQI and policy shaping. 


4. Model-Assisted Batch Reinforcement Learning 


As discussed in Section 2.2 the control policy h* is obtained 
using MABRL in combination with policy shaping, as illus¬ 
trated in Figure To this end, FQI is used to obtain an approx¬ 
imation Q* of the state-action value function Q* from a batch 
of four tuples as detailed in | 


^r^ = {(x,,M,,r,,x;),/=!,...,#^F^}, (15) 

where xj denotes the successive state in time to X;. 

As illustrated in Figure lu is a combination of experi¬ 
mentally observed tuples and virtual tuples generated by a 
model 

. (16) 

From the resulting Q*{x, u) a control action Uk can be ob¬ 
tained following 


Uk e arg min Q*{Tik, u). (17) 

U 


where x 6 is the input vector, w,- e ~ i.i.d. l/(-l, 1) 
is the weight vector connecting the input nodes with the i-th 
hidden node, bj e M ~ 1/ (0,1) is the bias of the i-th hid¬ 
den node, and yS, 6 M is the output weight of i-th hidden 
node. /3 = [/?i • ■ , P G is the vector of the out¬ 

put weights between the hidden neurons and the output node. 
G(x) = [g(wi ■ X, bi) ... g(w„ ■ X, bn)] e is the output 

vector of the hidden layer, where the nodes activation function 
g is a sigmoid. As the parameters of the hidden nodes (bias and 
input weight) are randomly generated, training an ELM corre¬ 
sponds to determining the output weight matrix p based on the 
least-squares solution of Gj8 = Y: 


P- 


(i+G^Gj'c^Y. 


(19) 


In Eq. 


19 


the term Y e 


pm.tl 


contains the target values from 


the train set; Y = [yi,y 2 , ■ ■ ■ ,ym\ , where m is the size of the 
train set; G e K'”x« is the network output matrix, constructed 
as: 


The following subsections detail the algorithms behind each 
block in Eigure 

4.1. The support model: Artificial Neural Network 

Eollowing the approach presented in Il24l . an ANN is used 
to represent a support model. In this work, single-layer, single¬ 
output Extreme Learning Machines (ELMs) are trained to pre¬ 
dict the change of internal temperature, AT. These are used 
as they allow for fast training of the weights of the network at 
the expense of reduced regression performance. The latter is 
partially mitigated by combining multiple ELMs in an ensem¬ 
ble EOl. 

The output of an ELM with p input neurons, n hidden neu¬ 
rons and one output node can be formulated as: 

n 

y(x) = ■ X, bi) = G(x)P , (18) 

1=1 


g(Wl • Xi, by) 


G = 


g(wi ■ Xj, bi) 


giy/i ■ xi, bi) 


g(y/i -Xj, bi) 


; e {1,... ,n) 


( 20 ) 

while the regularisation term C = 100 is used to enhance the 
robustness and the generalisation of the solution ED. 

This study uses an ensemble of 40 single-output ELMs, the 
ensemble model output ^^^(x) is given by a weighted average 
of the L individual ELM outputs; 


1 \ 1 

yAviv(x) = - ^yi(x), 

^ k=l 


( 21 ) 


The input vector at time kisxk = {x^k, Tk, To,k,Sk, Uk~i), and 
the output is yk - ATk - Tk+\ - Tk. A detailed description 
of ELMs can be found in ll52]| . As the training process is fast, 
finding the appropriate number of hidden nodes is done using 
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Algorithm 1 Model-assisted Fitted Q-iteration using a forecast 
of the exogenous data 

Input: T® = = 

yANN, r, n,CA,e 

1 : - generateTuples(^^,yAArA^, r, n, C, 0) 

2 : u 

3: let Qq be zero everywhere on A x t/ 

4 for A = 1,..., r do 
5: forl= l,...,#T^do 

6 : n<^p (xi, uf-'\ a) 

X- Qnj <- ri + min u) 

ueU 

9: end for ^ 

10 : use regression to obtain from 

11 end for ^ ^ 

12 return Q* - 


cross-validation. In the experimental trials, further detailed in 
Sec. 5.3 the ELMs regularization term is fixed to C = 100, and 
the ensemble of ELMs consists of 40 networks. 


4.2. Fitted Q-Iteration 

A popular BRL technique that found its way into several 
practical implementations is EQI ll49l . Typically, BRL tech¬ 
niques construct policies based on a batch of tuples. However, 
since in this context the reward function is known a priori and 
the resulting actions of the backup controller can be measured. 
Algorithm 1 uses tuples of the form ^x;, m;,xJ, Algo- 

rithm[T]shows how EQI ll49l can be used in a demand response 
application when a forecast of the exogenous data is available. 
Here xj denotes the successor state to x/. In Algorithm 1, the 
observed external temperature and solar radiance in x'^ ^ are re¬ 
placed by their forecasted value x'^ ^ (line 7 in Algorithm|^. As 
such, Q* becomes biased towards the provided forecast. Before 
constructing Q* a set 7^^, containing at most n virtual tuples 
is created with random state-action pairs. However, a randomly 
generated tuple is only accepted to if the nearest experi¬ 
mental tuple in falls outside a predefined radius r, following 
a distance metric aQ In order to keep Algorithm [^tractable in 
the event of a dense set F^, a computational budget H is added. 

The distance metric is defined as: A((x,x') ,(u,u')) = 
||x-x'|| + \\u — u'\\, where H-H is the Euclidean norm. Algo¬ 
rithm 1^ uses the artificial network jann to generate the virtual 
tuples as in pT] ) 

Similarly as in ||49]| . Algorithm 1 uses an ensemble of ex¬ 
tremely randomized trees Il5^ as regression algorithm to esti¬ 
mate the Q-function. 

In principle, other regression algorithms, such as artificial 
neural networks or support vector machines can be applied. 


* Note that this radius can also be defined based upon local inter-tuple dis¬ 
tances. This is however to be explored in future work. 


Algorithm 2 generateTuples(F^,yAA7A7, T n, H, 6) 
Input: F^, yANN, r, n, C, 6 
l: F^ = (0) 

2: N <-0 

3: while #F^ < n and A < H do 

4: generate random state action sample {x^., Uk} 

5: d- min {||x - x^.||-i -||m - 

(X, M)er® 

6: it d > r then 

7: uf-' = B{Xk, Uk, 6) 

8: T'l k ~ yANNitt-k, + TI,k 

F^ = F^u{(xi,M,,x;,Mf'")} 

10 end if 
11: A^AH-1 

12 end while 

13 return F^ 


4.3. Policy Shaping 

This section shows how to shape a policy h* by using trian¬ 
gular Membership Eunctions (MEs) ll22l and expert knowledge 
to enforce monotonicity in the policy. 

The centers of the triangular MEs are located on an equidis¬ 
tant grid with Ng MEs along each dimension of the state space. 

1^ I 

This partitioning leads to (A^ ) state-dependent MEs for each 

action. 

The parameter vector 0* that approximates the original policy 
can be found by solving the following least-squares problem: 

wr^ 2 

ffg 6 arg min ^ ([F(0j)](v;) - h{xi)) , 

®i; Ml 

s.t. expert knowledge 

where F denotes an approximation mapping of a weighted lin¬ 
ear combination of triangular MEs and [F(0g)](.r) denotes the 
policy F{6g) evaluated at state x. A more detailed description 
of how these triangular MEs are defined can be found in Il22l . 


4.4. Policy dispatch 

The policy dispatch block is in charge of operating the air 
conditioner according to the policy coming from the MABRL 
controller. The air conditioners controller selects each action in 
any encountered state with nonzero probability (exploration), 
or dispatches the optimal control action from the policy by ex¬ 
ploiting the acquired knowledge (exploitation) Il40ll . A com¬ 
mon technique to balance the exploration-exploitation in RE is 
e-greedy exploration: 

_ ( u ~ Bernoulli (0.5) if y < sj 

“ I [i^(0p](xQ ifr>e; ’ ^ 

where y ~ t/ (0,1). 

The following section presents the performance assessment 
of MABRL in simulated and real environment, where the ex¬ 
ploration factor Sj is designed to reduce by half every four days: 


? + 7 - 1 ' 


( 24 ) 
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In Eq. 24 the decay factor g is 4, the initial probability eq is 


0.4, and the day index is j e {1,2,3,...). 


5. Performance assessment 


advantage of the MABRL approach over the BRL. The follow¬ 
ing subsection presents the policy computation on the basis of 
experimental data, together with its experimental validation in 
a living lab. 


This section provides a qualitative and quantitative perfor¬ 
mance assessment of the controller discussed in Section 14.21 
Section 5.2 presents a quantitative assessment in a setting where 
the online part in Figure [T] is simulated via an equivalent ther¬ 
mal parameter model 153 of the air conditioner and the build¬ 
ing (further detailed in Appendix A| i. Section [53] provides a 
qualitative analysis of MABRL performance in a real climate 
control application. 


5.7. Policy representation 

In the following paragraphs, a graphical representation of the 
control policy is used to study the impact of integrating fore¬ 
casts, applying policy shaping and adding virtual tuples. In or¬ 
der to make a two-dimensional visualization, each control pol¬ 
icy is depicted for the forecasted outside temperature and solar 
radiation. As such, each two-dimensional mapping in Figure]^ 
maps the current quarter in the day and indoor temperature to 
a binary control action. Depending on the value of the binary 
control action, switching the HVAC system ON is represented 
by black areas and switching the HVAC system OFF by white 
areas. Note, the original control policy is denoted by h{x) and 
the shaped policy is denoted by [7^(0p](x). Especially for the 
experimental results, analyzing the policy is relevant as it gives 
offline insight in what the algorithm has learned and to eval¬ 
uate the contribution of different features in the control algo¬ 
rithm without having to copy exactly the same measurement 
conditions. For example evaluating the effects of policy shap¬ 
ing, adding virtual tuples and the sensitivity of the policy to 
different forecasts of external conditions. 


5.2. Simulated environment 


In order to test the convergence and the performance of the 
MABRL controller, a benchmark is required. As the experi¬ 
mental setup is a living lab, exact external conditions cannot 
be reproduced from day to day. Thus, the model described 
in Appendix A has been used to simulate the air conditioner 


and the lab room. An optimal solution to thermal scheduling 
obtained using MFC is taken as benchmark and is depicted in 
Figure This benchmark solution is explained in more detail 
in Appendix B. 

The simulations have been performed using different exoge¬ 
nous information and price profiles from day to day ll55l . In 
Figure 1^ the top graph shows the cumulated cost of the differ¬ 
ent controllers: BRL, MABRL, Optimal (MFC), and default 
thermostatic control. 

These results indicate that indeed the control approach as 
presented in Sectionj^is able to find near optimal control poli¬ 
cies in a learning time of approximately 20 days, after which 
the performance relative to a mathematical optimum is stable. 
However, adding virtual tuples, as illustrated in the procedure 


in Section 4.2 has limited contribution, with a slight economic 




Figure 2: Cost performance of the different controllers: BRL (no virtual tuples), 
MABRL, Optimal (MFC), and Default (hysteresis). Top plot: cumulative elec¬ 
tricity cost. Bottom plot: daily economic performance. 


5.3. Experimental environment 

The setup consists of two air conditioners (Fig. [^, one tem¬ 
perature sensor to measure the room temperature in one point, 
one pyranometer to measure the solar radiation on the roof, one 
temperature sensor to measure the external air temperature, and 
one power meter to measure the air conditioners power con¬ 
sumption. Data points are collected every 5 minutes. 

External forecasts are provided three times a day with a gran¬ 
ularity of 15 minutes and a prediction horizon of 8 hours. The 
policy is recomputed three times a day, as new forecasts arrive, 
and dispatched on 5 minutes basis. 


5.3.1. Ability to integrate forecasts 

A first analysis focuses on the ability of the control approach 
to effectively take into account the forecast of exogenous in¬ 
formation. To this end. Figure]^ shows policies organized in a 
matrix for different forecasts of the external temperature, where 
each column corresponds to the same forecasted external tem¬ 
perature and where the numbers of experimental tuples are in¬ 
creased from top to bottom. No virtual tuples have been added 
to the batch. Starting from the top row, policies are computed 
using experimental batches of: two days, eight days, and six¬ 
teen days. Note that, recalling Eq. 13 once the heating system 
has been switched on it continues heating until the temperature 
upper bound is reached in order to avoid frequent switch events. 
When considering the first row, with only two days worth of 
data in the batch, the policies obtained are near invariant for the 
temperature forecasts used. This is attributed to the fact that the 
variation in the outside temperatures observed is limited. The 
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Figure 4: Policy projections obtained from experimental data for different forecast of the outside temperature. The forecasted outside temperature is the same for 
all policies in each column and is depicted in the lower row. The different policies in each row have been constmcted with different amounts of experimental data. 
From top to bottom, batches containing experimental tuples from 2, 8 and 16 days have been used. The lowest row indicates the forecasts of outside temperature 
for which the policies have been calculated. 


policies depicted in the second row and certainly in the third 
row show significantly more dependency to the forecasted out¬ 
side temperatures as the batch of observed tuples contains more 
variation in terms of outside temperature. Considering the poli¬ 
cies depicted in the third row, it can be observed that the policy 
corresponding to the lower forecasted outside temperature (sec¬ 
ond column) is less responsive to low energy prices. Whilst the 
policy obtained for a high forecasted outside temperature (third 
column) is more responsive. This is inferred from the policy ad¬ 
vising the HVAC system to switch ON only for a small region 
in the state space in the second column, compared to the third 
column. This is meaningful as at a low outside temperature, 
energy stored in the room (through an increased temperature) 
is lost faster, preventing the HVAC system from avoiding high 
energy prices. 

5.3.2. Policy shaping 

Figure ISl^ows the effect of shaping the policy as discussed 
in SectionK^ Here triangular membership functions Il22l have 
been used with the constraint that the policy needs to be strictly 
decreasing with increasing indoor temperature. This is a direct 


consequence from the physical understanding that a room at a 
higher temperature is subjected to higher losses to the environ¬ 
ment. The results depicted in Figure show the effect of the 
policy shaping. In the upper row, the original (h(x)) policies 
are depicted, whilst the second row shows the shaped policy 
(^’[^‘Kx)). It can be observed that the shaping results in gaps 
in the policy being filled. These gaps are attributed to a non- 
uniform sample density in the batch and the random nature of 
the extra-trees regression algorithm 1491 . 

5.3.3. Effect of adding virtual tuples 

Figure shows the impact of virtual tuples on the conver¬ 
gence of the policy. The policies depicted in Figure]^ are com¬ 
puted using different proportions of experimental data and vir¬ 
tual tuples. The left graph in Figure]^ shows the computed pol¬ 
icy with 2 days of experimental tuples and no virtual tuples, 
which is called early policy. 

The right graph shows the policy computed using a large set 
of experimental data, without virtual tuples (20 days), called 
regime policy, this policy is considered to be the best possible 
policy that can be calculated using the data. The middle graph 
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Figure 5: Illustrated effect of policy shaping. Original form (row n. 3 from Figure 3), in the upper row, versus shaped policies, in the lower row. 
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Figure 6: Computation of the closed-loop policies for dilferent shares of virtual tuples over experimental data. The left graph shows the early policy, obtained 
from 2 days of experimental tuples, while the right graph shows the regime policy, obtained from 20 days of experimental tuples. The middle graph shows the 
model-assisted policy obtained using 2 days of experimental and virtual tuples. 


shows the computed policy using 2 days of experimental tuples 
and 2 days of virtual tuples from the support model, which is 
trained using the 2 days of experimental data. This latter is 
called model-assisted policy. One can recognize that the model- 
assisted policy is more similar to the regime policy than the 
early policy (94% overlap with the regime policy for the model- 
assisted policy compared to 90% overlap for the early policy). 
For example the model-assisted policy around time steps 27, 42 
and 78 resembles more the regime policy, this is attributed to 


the virtual tuples. 

5.3.4. Power profiles 

Finally, a more direct indication of the performance of the 
approach presented in this work is illustrated in Figures |7] and 
These figures show the results of implementing the control 
approach after 12 and 16 days. The outside temperature during 
the experiment is depicted in the top graph. The middle graph 
shows the actual power consumption of the HVAC system and 
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Figure 3: Artist impression of the experimental setup. Depicted are the two 
HVAC units and the sensors. Dimensions: 7.9m (L), 7.8m (W), 5m (H). 


the energy price. The bottom graph depicts the internal temper¬ 
ature and the control policy. From both figures, it can be ob¬ 
served that the HVAC typically switches on at the beginning of 
a low price period, largely avoiding subsequent high prices, and 
this for two distinct outside temperature regimes. This demon¬ 
strates the efficacy of the model-free control method presented 
in this work for two different outside temperature regimes. 


6. Conclusions 


In this work, model-assisted batch reinforcement learning 
has been deployed to harvest the flexibility related to climate 
control. The results from a quantitative analysis using a sim¬ 
ulated environment showed that a performance within 90% of 
a mathematical optimum is obtained within approximately 20 
days. These results have been conhrmed qualitatively after de¬ 
ploying the proposed control approach to a living lab. After col¬ 
lecting data for approximately 16 days the control approach was 
able to generalize policies for different forecasts of the outside 
temperature. A policy shaping method has been used to shape 
the policies by enforcing that the policies need to be strictly de¬ 
creasing with an increasing indoor temperature. Furthermore, 
qualitative results indicate that adding virtual tuples from a sup¬ 
port model can improve the quality of the control policies when 
the number of experimental tuples in the batch is small. How¬ 
ever, adding virtual tuples resulted only in a limited increase of 
the performance. 

Future work will be aimed towards evaluating the presented 
algorithm for different objectives related to demand response. 
Furthermore, other types of models, such as gray-box models, 
will be used to add virtual tuples. 


Table A.l: ETP model parameters 


Parameter 

Value 

Ra 

no °C/kW 

Ca 

2.5E + 06 kWhrC 


2000 °C/kW 

c 

1.2E + 01 kWhrC 

A, 

0.5 

A, 

1 
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Appendix A. ETP model 

The data batch of the simulated experiments is provided by 
an Equivalent Thermal Parameter model (ETP) ll54i that is fitted 
on experimental data from the living lab. The heat flow for the 
ETP model of a residential heating/cooling system is defined as 
follows: 

A = (To - To) + ^ (Tn, - To) + AsQs + Ac2ac] 

^[-t(Ta-T„,) + (\-A,)Q,] 

(A.l) 

where Cm equals the thermal mass of the building envelope. To 
is the outside air temperature, Ta is the inside air temperature 
and Tm is the envelope temperature. Rm is the resistance be¬ 
tween the inner air and the envelope, while Ca represents the 
thermal mass of the air. The heat flux into the interior air mass 
is given by a fraction of the solar heat gain Qs, a fraction Ac 
of the heat gains of the air conditioners Qac- The heat flux to 
the building envelope is given by the thermal exchange with the 
inner air and by a fraction A^ of the solar heat gain. 



Figure A.9: Sketch of the equivalent thermal parameter model. 


Appendix B. Benchmark 

An optimal solution of the considered HVAC control prob¬ 
lem can be found using a mixed-integer linear programming 
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Figure 7: Experimental results 1. Top plot: outside temperature. Middle plot: consumed power by the HVAC and electricity price. Bottom plot: control policy 
obtained with MABRL and the resulting indoor air temperature. 


solver. The objective of the HVAC controller is to minimize its 
electricity cost using known prices A e MJ: 

T 

minY PAruf-'At (B.l) 

r=l 

subject to; 

Xt+l = f(Xk, Uk, Wk) 

^phys _ 


where / is the equivalent thermal parameter model dehned by 
(A.l) and B is the backup controller dehned by (13). Notice 
that the optimal controller knows the equivalent thermal param¬ 
eter model, the settings of the backup controller and the future 
disturbances. An optimal solution of this optimization prob¬ 
lem was found by applying a mixed-integer linear programming 
solver using Gurobi Il56l . 
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